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Abstract 



In Information Retrieval (IR), whether implicitly or explicitly, queries 
and documents are often represented as vectors. However, it may be 
more beneficial to consider documents and/or queries as multidimensional 
objects. Our belief is this would allow building "truly" interactive IR 
systems, i.e., where interaction is fully incorporated in the IR framework. 

The probabilistic formalism of quantum physics represents events and 
densities as multidimensional objects. This paper presents our first step 
towards building an interactive IR framework upon this formalism, by 
stating how the first interaction of the retrieval process, when the user 
types a query, can be formalised. Our framework depends on a number 
of parameters affecting the final document ranking. In this paper we ex- 
perimentally investigate the effect of these parameters, showing that the 
proposed representation of documents and queries as multidimensional ob- 
jects can compete with standard approaches, with the additional prospect 
to be applied to interactive retrieval. 

1 Introduction 

Most information retrieval (IR) models, including probabilistic and vector ones, 
use the same underlying one-dimensional representation of documents and que- 
ries, i.e., as vectors defined in a vector space, typically a term space. However, 
this representation has some limits when dealing with more complex IR aspects 
like interaction, diversity and noveltjj^J Indeed, recent research showed that 
these complex aspects of the retrieval process benefit from more sophisticated 
representations of documents and queries [151 13] , in particular those providing 
for more powerful geometric manipulations of IR components. 

The representation of documents and queries in IR should evolve so the 
user interaction can be incorporated in a natural and principled way in the IR 
process |13j . Our claim is that representing documents and queries as multidi- 
mensional objects (e.g. subspaces in a vector space) allows for not only a novel 

*This an extended version of a paper published in RIAO 2010 [8]. 
In our research, we are particularly interested in these aspects of the IR process. 



but also a more powerful way to tackle this challenge. This representation is 
particularly interesting from a theoretical point of view because it is possible 
to use a principled interpretation of the probabilities associated with such mul- 
tidimensional objects, which comes from quantum physics |13j - the so-called 
"quantum probabilities" framework. This representation is also interesting from 
an intuitive point of view because it relies on a geometric representation of docu- 
ments and queries in a vector space, which has proved successful in IR 2 . This 
representation reveals also a strong connection between orthogonality (in the 
vector space) and non-relevance, which has been successfully used to represent 
term negation in queries [T5] . 

In [10] , a framework for interactive IR that relies on such a multidimensional 
representation of documents and queries was proposed. In this framework, the 
user's information need (IN) is represented by a set of weighted vectors that 
evolve with the user's interaction. A probability of relevance of a document 
(for that IN) is computed with respect to this set. Although the components 
of our framework were described, they remained abstract. In particular, no 
explicit document and query representations were proposed. The next step is 
to operationalise the framework, which is the focus of this paper. We show how 
document and query representations are computed to then allow estimating the 
probability of relevance of the document to a given IN. 

With respect to related work, multidimensional representations, respectively, 
of queries were used in [T3] to model negative user feedback, and of documents 
were investigated in [3] in an ad hoc setting. Our work encompasses those since 
it provides a principled and probabilistic way to work with multidimensional 
objects. Finally, two lines of research explored, respectively, a subspace repres- 
entation of documents [7] and of a user's IN 0. In our work, we go further and 
show that both documents and INs can be represented as multidimensional ob- 
jects, and propose a principled methodology to construct these representations. 

The outline of this paper is as follows. We first briefly introduce our frame- 
work and describe how the probability of relevance is computed within the 
quantum probability framework (Section [2]). Next we show how we construct 
the query and document representations, and introduce several parameters for 
these representations (Sections [3] and [4]). Finally, we present experimental res- 
ults, which validate our document and query representations, some of the in- 
vestigated parameters, and give insights on how our framework can be further 
developed (Section pj). 

2 A Quantum-inspired View for IR 

Our IR framework is built upon [10j . which is based on quantum probabilit- 
ies and where we assume that there exists a vector space of pwr^] information 
needs (INs), where each vector corresponds to an IN that completely charac- 
terises a possible user's IN - by analogy with quantum physics where a vector 
completly characterises a physical system. Knowing a user's pure IN would de- 
termine which documents the IR system should return to that user. From a 
geometric perspective, a pure IN is answered by a document with a probability 

2 The concept of "pure" IN is new and central to our framework. In this paper, we use "pure 
IN" to distinguish it from "IN", where the latter refers to information need in its usual sense 
in IR, e.g., see [6]. 
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that depends on the length of the projection of the pure IN vector onto the doc- 
ument subspace. Because of the uncertainty attached to the IR search process, 
we suppose that the information being searched by a user can be represented 
by a set of such pure INs, one for each possible pure IN that composes a user's 
IN. 

To compute a probability of relevance of a document to a user's IN, we make 
use of the generalisation of probabilities developed in quantum physics, which 
is strongly connected to the geometry of the space used to represent events and 
densities. A probabilistic event is represented as a subspace (denoted S) in a 
Hilbert spac^J Let us assume that S is the event "the document is relevant". A 
probability can first be defined for a pure IN, represented as a unit vector ip, by 
computing the length of the projection of the vector ip onto the subspace S, that 



is by computing the value 
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where S is the projector onto the subspace S. 
This value is the probability that the document is relevant with respect to the 
pure ir'Q 

When a user starts interacting with an IR system by, for instance, typing 
a querjj^] we first compute (see Section [3| an initial set of weighted pure IN 
vectors, where each weight is the probability that the pure IN corresponds to 
the actual user's IN. This captures the uncertainty typical to IR where firstly, the 
representation is only an approximation of the user's IN, and, secondly, the query 
may be ambiguous. The goal of an IR system is to reduce this undeterminism 
through interaction. 

More formally, we assume that each pure IN vector <pt is associated with a 
probability pi (the weight). We define the probability of the event S by using 
the usual total probability theorem (across all possible pure INsJ^] 

Pr(S) = ^p i Pt(S\p i )=Y i p i <p7S<p i = tt(pS) (1) 

i i 

where tr is the trace operator [131 p. 83] and p = 'Yl li Pi l Pi l pJ is called a density 
operate 1 ^ and corresponds to a (probabilistic) mixture of the pure INs ipi. In 
general, any operator p characterised by the fact that it is both positive-semi- 
definitc|^] and of trace 1 defines a probability distribution over the subspaces, 
i.e. it is possible to interpret Pr (S) — tr (pS^j as a probability [13]. 

For each document d, we compute a projector Sd (Section^ and, for a query 
q, the IN density p is approximated by p q (Section kk. Using the projector Sd 
and the density p q , the probability that a document d is relevant to the query 

q is then given by tr (p q Sdj ■ 

In our work, we assume that the vector space of pure INs is the term space, 
where each dimension corresponds to a term. A pure IN is hence described by 
a series of weighted terms. A (simplified) example is shown in Figure [l] where 

3 Hilbert spaces (roughly, vector spaces with complex scalars) are a central mathematical 
concept in quantum physics. 

4 We have \\Sip\\ 2 £ [0, 1] since \\<p\\ = 1. 

5 Queries are what (usually) users provide to an IR system, as means to express their INs [6]. 

6 As in quantum physics, we assume different ipi correspond to different systems and are 
thus mutually exclusive. 

7 We will omit the term "operator" in the remaining of the paper. 
8 This means v T pv > for any vector v. 
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the pure IN "pop music" (one unit vector) is represented by the terms "music", 
"chart" and "hit" of the term space. We show now how document and query 
representations are computed in this term space. 



3 Creating the Document Subspace 

It is reasonable to assume that a typ- 
ical document answers various (pure) *Tenn: Hit 
INs, since it is likely to contain an- I 
swers (be relevant) to several quer- 
ies. Moreover, [IT] have shown in the Music 
context of XML retrieval, that an- L***^—--—* Teml: Music 
swers to topics (statements of INs) 

usually correspond to document frag- \ Tg c , 

ments and not full documents. Build- 
ing on this, we assume that for each Figure 1: A pure IN in a term space 
document there is a mapping between 
its (possibly overlapping and non- 
contiguous) fragments and a set of pure INs. 

A document is thus associated with a set Ud of vectors in the IN space. We 
hypothesise that a document is fully relevant to a pure IN if the latter can be 
written as a linear combination of the vectors of Ud-, that is, if it is contained 
in the subspace Sd defined as the span of the vectors in Ud- The document will 
be partially relevant to a pure IN with a probability that depends on the length 
of the projection of the pure IN vector onto the subspace Sd- The subspace 
Sd can be interpreted as a geometric representation of the event "the document 
is relevant". This construction process was validated in a document filtering 
task [pj. In this paper, we investigate the effect of several parameters (written 
in bold below) on this process. 

Document Fragments. We now assume that document fragments are 
disjoint, and are obtained through a "natural" segmentation of the document. 
Various choices are possible, and our first strategy is to use a single fragment, 
the document itself. This corresponds to the vector space approach where a 
single vector represents a document. The second strategy is to use paragraphs 
as fragments as they seem to be of an appropriate size to correspond to a pure 
IN. We also selected a third type of fragment, the sentence, as it is one of the 
smallest coherent units in a document. 

Weighting Schemes. We now need a vector representation for each frag- 
ment. Three weighting schemes are used, namely, tf-idf, tf and binary (term 
presence/absence). The latter two are chosen since they allow substantial re- 
duction in computational complexity. In addition, binary vectors are close or 
equal to tf vectors for small fragments, for example, sentences. 

Ud is formally defined as the set of vectors associated with a document d, 
obtained through one of the above segmentation and weighting scheme, i.e., we 
have one vector for each fragment. As discussed before, we need to compute 
the subspace Sd spanned by the vectors of Ud- For this, we use an eigenvalue 
decomposition where YlweUd ( P L P T ^ s expressed as J2iLi \ v i v J where D is the 
number of eigenvectors with non null eigenvalues (D is also the dimension of 
the associated subspace), A, > are the eigenvalues (we suppose without loss 
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of generality that they are of decreasing magnitude, i.e. Xi > Aj+i) and the 
vectors i?j form an orthonormal basis of the subspace Sd [H] • 

Dimension selection. As the vectors constructed from the terms occurring 
in the document fragments are only an approximation of the underlying pure 
IN vectors, the vectors from Lid will contain terms that should not be associated 
with the document. We are thus interested in the eigenvectors associated with 
the K highest eigenvalues since low eigenvalues are likely to be associated with 
noise [5]. We are interested in measuring the effect of different dimensions to 
represent a document. Hence, we chose a simple strategy, where we keep the 
eigenvectors whose eigenvalue is higher than the average of the eigenvalues, 
which wc compared to two extreme strategies, namely, the case where we select 
the eigenvector with the highest eigenvalue (one dimension, K = 1) and the 
case where we keep all the eigenvectors (full dimension, K = D). 

Finally, the projector Sd associated with the K dimensional subspace of 
document d is expressed as 2j=i v i v J ■ 

4 Creating the Query Density 

We now focus on the primary contribution of the paper, namely, the construction 
of the IN density p q for a given query q. 

As a query in its simplest form consists of a set of terms, we are first in- 
terested in building the query representation for a query composed of a single 
term, t. We described how a document is represented as a set of pure IN vectors 
corresponding to different fragments of the document. Wc extend this idea, and 
suppose that a query term t can be represented as the set U t of pure IN vectors 
that correspond to document fragments containing the term t. That is, we use 
the immediate surroundings of the term occurrences in the documents of the 
collection being searched to build that term representation. This is similar to 
pseudo-relevance feedback using passages from retrieved documents containing 
the query terms [T]. The difference is that we use all the passages to build the 
query representation as we want to consider all possible pure INs associated 
with the term t. 

As we have a priori no way to distinguish between the different vectors in 
Lit, we assume that each vector is equally likely to be a pure IN composing the 
user's actual IN. Hence, a document is relevant to the user's IN if it is relevant to 
any of the vectors of Li t , where the vectors are drawn with a uniform probability. 
The corresponding density is then written as: 



where N t is the number of vectors associated with term t (the cardinality of Lit). 
This definition of p t has all the required properties of a density (see Section [5]). 
In practice, this representation of a single-term query t means that, the more 
vectors ip from Lit he in the document subspace, the higher the relevance of 
the document to the query. This query representation hence favours documents 
containing different "aspects" of the IN, each of them as represented by one of 
the pure INs in U t associated with a query term t. 

We discuss next the representation of a query composed of several terms. 
There are three main parameters (written in bold below). 




(2) 
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(a) A superposition of two INs 



(b) A mixture of two INs 



Figure 2: Combining INs 



Weighting scheme. As for documents, three weighting schemes, namely, 
tf-idf, tf and binary, are used to build the vectors forming Ut ■ 

Query construction (mixture). The above query representation (Equa- 
tion^ can be generalised to a query composed of several terms. We assume that 
a relevant document should equally answer all pure INs associated with each 
query term. To compute the probability of relevance of a document d, we first 
select a term from the query (with a probability w t , see the next paragraph), 
and then one of the vectors in Ut- With this vector, we compute the probability 
of document d to be relevant to this pure IN. We repeat the process and average 
over all the possible combinations. This defines the probability of relevance of 
document d given the query. Formally, this corresponds to a density defined 
as a mixture of all the pure IN vectors associated with the query terms. This 
density is built from the individual query term densities pt (Equation p| : 



Query term weight. The weights w t are used to quantify the importance 
of each term t of the query. We experimented with two settings, one where all 
the Wt were equal, and the other where they were set to the corresponding term 
idf values. In both approaches, we normalise the weights so their sum equals 1. 

We present a second query construction process, inspired from IR and quantum 
theory. In vectorial IR, a query is represented by a vector that corresponds to a 
linear combination of the vectors associated with the query terms. In quantum 
theory, a normalised linear combination corresponds to the principle of super- 
position, where the description of a system state can be superposed to describe 
a new system state. 

In our case, the system state corresponds to the user's pure IN, and we 
use the superposition principle to build new pure INs from existing ones, as 
illustrated with the example shown in Figure [2 Let ip p , ip c / u k and f c / US a be 
three vectors in a three-dimensional IN space that, respectively, represent the 
INs "I want a pizza", "I want it to be delivered in Cambridge (UK)" and "I want 
it to be delivered in Cambridge (USA)". The pure IN vector "Pizza delivered in 
Cambridge (UK)" would be represented by a (normalised) linear combination (or 




(3) 
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superposition) of ip p and (p c / u k, as depicted in Figure [5|a). We can similarly 
build the IN for Cambridge (USA). To represent the ambiguous query "pizza 
delivery in Cambridge" where we do not know whether Cambridge is in the USA 
or the UK, and assuming there is no other source of ambiguity, we would use a 
mixture of the two possible superposed INs, as depicted by the two vectors of the 
mixture in Figure [2jb) , which brings us to another variant of query construction, 
the mixture of superpositions. 

Query construction (mixture of superpositions). To compute the 
probability of relevance, for each term t of the query, we randomly select a vec- 
tor from the set U t - We then superpose (i.e., compute a linear combination) the 
selected vectors (one for each term), where the weight in the linear combination 
is yjwl (see below for why we use a square root). From this vector, we com- 
pute the probability of the document to be relevant to this IN made from the 
superposition of IN vectors (one per query term). With respect to our example, 
the set Upi zza would be just one vector ("I want a pizza to be delivered'), and 
^Cambridge would contain two vectors (one for UK, one for USA). 

As with the simple mixture approach, the above process can be repeated for 
all the possible selections of vectors and the corresponding query density is: 



where Z q is a normalisation coefficient, and ti (i = 1 . . . n) are the n query terms. 
We use Nt to ensure that each term contribution is equally important, and 
square roots because both N t and w t appear two times in the above formula. In 



Note that for one-term queries, the two described query constructions (mix- 
ture and mixture of superpositions) give the same result. Another important 
point from a computational perspective is that in both cases, the query can be 
estimated from single term densities (not demonstrated for Equation |4| . We 
hence pre-compute the densities pt for each term t, and use them at query time 
to compute p q and pq S \ 

Dimension selection. As for the representation of documents, both dens- 
ities are expressed, through eigenvalue decomposition, as a sum J2f=i ^i v i v J 
where the (Ai,«i) are eigenpairs ordered by decreasing eigenvalues. Our final 
density used for computing the probability of relevance is then p q = Yl,f=i \ v i v J 
where K is the selected dimension (where K < D). We use the same three 
strategies to set K that were used for the document representations (see end of 
Section [3]). 

5 Experiments and Analysis 

In previous work [5], we validated the subspace document representation on 
a filtering task. In this paper, we explore both the document and the query 
representations in an ad hoc retrieval task. In particular, we look at the effects 

9 The effect will be to give higher importance to superpositions of vectors ipi who are similar, 
i.e., whose cosine is closer to 1. 
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Parameters 


Means 


(1) Document fragment 


sentence (0.14) >> paragraph (0.12) >> document 
(0.11) 


(2) Weighting scheme 


tf (0.13) >> tf-idf (0.12), binary (0.12) 


(document fragment) 




(3) Weighting scheme 


tf-idf (0.13) » tf (0.12), tf-idf > binary (0.12) 


(query) 




(4) Dimension selection 


all (0.14) >> highest (0.11), mean (0.14) >> highest 


(document) 


(0.11) 


(5) Dimension selection 


all (0.13), mean (0.13), highest (0.12) 


(query) 




(6) Term weight in query 


idf (0.13) >> uniform (0.12) 


(7) Query construction 


mixture (0.13), mixture of superpositions (0.13) 



Table 1: Means of medians of average precision for each topic. The ">" (resp. 
">>") sign is used to denote statistical significance at 0.05 (resp. 0.01). 



of the parameters discussed in Sections [3] and |3J These are listed on the left 
column of Table [l] As the parameters are mostly independent from each other, 
we experimented with 756 settings; those not making sense were ignored/^*] 

We used the INEX 2008 collection in our experiments because its documents 
have markup (in XML format) delineating text units. The collection consists of 
659,388 Wikipedia documents in XML format, using tags such as article, section 
and paragraph to model a document logical structure |4j. INEX 2008 has 70 
assessed topics, and for each topic, relevant passages in (pooled) documents were 
highlighted by human assessors. A document containing a relevant passage is 
assumed relevant, which is in accordance with Trec guidelines. 

We preprocessed the documents by extracting the fragments, i.e., the whole 
document, the paragraphs (as determined by the XML markup) and the sen- 
tence;^ We then stemmed and stopped (using the SMART list of stop-words) 
the text fragments. For each term t, we computed an approximation of the 
term density p t (Equation [2]) based on a sample of 10,000 documents (max- 
imum) containing the term t and using a thin eigenvalue decomposition with 
maximum rank set to 10 [12 pp. 171-181]. This value, chosen through exper- 
imentation, represents a good trade-off between complexity and efficiency. For 
each query q, we computed the query density p q using the densities p t of its 
composing terms t, using either the simple mixture (Equation |3| or the mixture 
of superpositions (Equation [4| . Then, we first retrieved a set of 1,500 docu- 
ments using BM2Epj [TJ]. For each retrieved document d and each parameter 
setting, we computed the projector Sd and computed a probability of relevance 
as tr (^p q Sd) ■ We used this value to re-rank the documents. 

Table fl] shows our results. For each parameter (left column), we show in 
the right column the means of the medians of average precision computed for 
the different settings of that parameter. For example in row (1), when the 



10 When using a whole document as fragment, the document subspace is one-dimensional 
and in this case there is no point to investigate the dimension selection parameter. 
n We use http://www.andy-roberts.net/software/jTokeniser/index.html for this. 
12 With the standard parameter values. 
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Mixture of superpositions 


Mixture 


Aap 


Topic 




Topic 


0.22 


social networks mining 


0.32 


"records management" metadata 


0.19 


virtual museums 


0.16 


Tata Motors Company in India 


0.10 


genetically modified food safety 


0.15 


Nikola Tesla inventions patents 


0.09 


wikipedia vandalism 


0.08 


vodka producing countries 


0.06 


flower meaning 


0.08 


mahler symphony song 



Table 2: Top five performing topics using, respectively, mixture of superpositions 
(Equation ffl) and mixture (Equation p|) as query representation. 



fragment is "sentence", this value is 0.14. To compare two settings, say "sen- 
tence" vs. "paragraph", we performed a paired t-test where each pair of samples 
corresponds to the same topic and same parameter values (weighting scheme, 
dimension selection, query term weight, query construction) but for the docu- 
ment fragment setting. For this example, the result shows that using sentence 
fragments outperformed paragraph fragments at a 0.01 significance level. We 
discuss each result next. 

For the document fragment parameter (1), the best performing setting was 
with "sentence" followed by "paragraph" and "document". Each time the differ- 
ence was found to be significant at a 0.01 level. This indicates that the right 
level of segmentation (to construct the pure IN vectors) is at sentence level. 

Overall, the weighting scheme for document fragments and queries had some 
effect on retrieval effectiveness. For building the query term density (3), the 
tf-idf scheme led to significantly better results, whereas for document frag- 
ments (2), the tf scheme performed better. The results are somehow in con- 
tradiction with vectorial IR findings, but might stem from the fact that to build 
the query term representation we sample much more vectors than for the doc- 
ument one; hence in the former case it is important to weight terms according 
to their importance (idf). When looking more in details into the results, we 
also found out that the weighting scheme was highly dependent on the other 
parameters, and should hence be chosen depending on them. 

The setting of the subspace dimension has a different effect on documents and 
queries. For documents (4), performance was improved using the full dimension 
or the mean of the eigenvalues (to determine the dimension of the subspace 
representation). This shows that using more than one dimension to represent 
a document is beneficial. However, for queries (5) we observe only a slight 
improvement when using multiple dimensions (none of which were significant). 

For the query construction methodology, we first see that weighting the query 
terms by their idf values outperformed using a uniform scheme (6) . When look- 
ing at a mixture vs. mixture of superpositions (7), no significant overall per- 
formance difference exists. However, we observe different behaviours depending 
on the topic. Table [2] shows the best performing topics for, respectively, the 
mixture of superpositions and the mixture. The topics better handled by the 
mixture of superpositions are topics for which the terms form a "concept", for 
example "social networks mining" where the three terms together have a specific 
meaning. For the mixture, topics for which each term reflects a different aspect 
of the topic, e.g. ""records management" metadata", where "metadata" and "re- 
cords management" are the two different concepts, had a better performance. 
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This indicates that selecting the query density computation according to the 
topic may prove beneficial. 

The above example suggests that it may be beneficial to treat parts of 
the query differently by combining both construction methods into one query. 
For example, the terms "records" and "management" form a single aspect and 
should thus be superposed. Afterwards, the superposed terms should be mixed 
with "metadata", which describes another aspect, to answer the query ""records 
management" metadata". In general, to determine which terms form a single 
concept, we can rely on explicit markers like quotes in this example, or on an 
automatic algorithm based e.g. on co-occurrences. 

We also compared our results to 
a state-of-the-art retrieval IR system, 
namely BM25 U\. We found that the 
performances of our framework were con- 
sistently lower in average (using stand- 
ard IR evaluation metrics). A brief 
analysis (not reported here) comparing 
the results of the best performing con- 
figurations with BM25 for the topics in 
Table [2] reveals that we could get closer 
to BM25 performances by (again) choos- 
ing the right query construction method- 
ology (mixture vs mixture of superposi- 
tions). 

Finally, we investigated the effect 
of query length (number of terms) and 
the number of relevant documents (of a 
query) on retrieval effectiveness. No cor- 
relation was found between the difference 
in performance between BM25 and our 
framework, and the number of relevant 
documents. There was however a strong 
dependency on the query length. As illus- 
trated in Figure [3j when the query length is one (there is no difference between 
the two query density construction methods), our approach outperforms con- 
sistently BM25; when the number of terms in the query increases, retrieval per- 
formance drops. This further confirms that the appropriate calculation of the 
query density - in particular for multi-term queries - needs to be investigated. 

6 Conclusion and Future Work 

In this paper, we presented a methodology to build multidimensional represent- 
ations of documents and queries. These representations are inspired from the 
geometric/probabilistic framework of quantum physics. The latter allows us to 
compute probabilities of relevance based on a more complex representations of 
documents than a simple bag of words, namely, a multidimensional one based on 
document fragments. We believe that such a multidimensional representation 
is key to a successful framework for exploiting user's interaction [13] . 

We performed experiments to explore various parameters influencing the ef- 




Figure 3: Boxplot of the effect of 
query length (number of terms) on 
average precision. The x-axis is the 
query length (number of terms) and 
the y-axis is the difference in av- 
erage precision between BM25 and 
our method in different settings. 
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fcctiveness of our representations. We showed that using more than one dimen- 
sion to represent documents improves performance, confirming previous results. 
Considering a document as a fragment, as done in most classical models, is 
not sufficient to distinguish between the different information needs a document 
covers. Indeed, while most of the classical models only take the mere occur- 
rence of a term into account, we showed in our experiments that the vicinity of 
terms (the fact that they appear in the same fragment) plays an important role. 

We also explored two different and principled ways to construct the query 
representation. We have shown that queries whose terms define a concept and 
those whose terms are more independent are better handled by two different 
methods, respectively, the mixture of superpositions and the (simple) mixture. 
This suggests that we can gain further improvements if both strategies are ap- 
plied together in an adaptive manner. This is part of our future work. 

As our representation of queries and documents aims at tackling interactive 
IR, this works validates our framework for the most common first interaction 
step between a user and an IR system - a user typing a query. Exploiting further 
interaction steps (for example viewing or saving a document), is also part of our 
future work. 
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